[UR][L0v2] Migrate discrete buffer through host when P2P is not accessible#22010
Open
ldorau wants to merge 3 commits into
Open
Conversation
Contributor
Author
|
Please review @intel/unified-runtime-reviewers-level-zero |
mateuszpn
reviewed
May 14, 2026
| // Migrate buffer through the host: copy from the current device to a | ||
| // temporary host buffer, then from host to the target device. | ||
| auto bufferSize = getSize(); | ||
| std::vector<char> hostBuf(bufferSize); |
Contributor
There was a problem hiding this comment.
nit: maybe it is worth to consider USM allocation in place of heap, like in line 100
pbalcer
reviewed
May 14, 2026
Comment on lines
+369
to
+372
| for (uint32_t i = 0; i < waitListView.num; i++) { | ||
| ZE2UR_CALL_THROWS(zeEventHostSynchronize, | ||
| (waitListView.handles[i], UINT64_MAX)); | ||
| } |
Contributor
There was a problem hiding this comment.
I don't think this will work. The operation also needs to be ordered with regards to the command list itself, so something like this will be better:
if (numWaitEvents > 0) {
ZE2UR_CALL(zeCommandListAppendWaitOnEvents,
(zeCommandList.get(), numWaitEvents, pWaitEvents));
}
ZE2UR_CALL(zeCommandListHostSynchronize, (zeCommandList.get(), UINT64_MAX));
| auto bufferSize = getSize(); | ||
| std::vector<char> hostBuf(bufferSize); | ||
|
|
||
| UR_CALL_THROWS(synchronousZeCopy(hContext, activeAllocationDevice, |
Contributor
There was a problem hiding this comment.
I don't like the fact that this is synchronous. Can you explore what it would take to make it async? I think we'd need to keep the allocation somewhere.
Contributor
Author
There was a problem hiding this comment.
Changed. Is it OK now?
Contributor
Author
|
@mateuszpn @pbalcer re-review please |
1 similar comment
Contributor
Author
|
@mateuszpn @pbalcer re-review please |
9727548 to
1e9d552
Compare
…sible When a buffer on a discrete GPU needs to be accessed from a different device and P2P access is not enabled, migrate the data through a USM HOST staging buffer instead of returning UR_RESULT_ERROR_UNSUPPORTED_FEATURE. The migration uses a two-step copy: 1. Synchronous device->host copy using the source device's own command list (the destination device cannot reach source device memory without P2P). 2. Async host->device copy enqueued on the caller's command list (host memory is accessible by all devices, so this is safe). Before the device->host copy, any pending operations on the caller's command list are ordered and drained via zeCommandListAppendWaitOnEvents + zeCommandListHostSynchronize, ensuring prior kernel writes to the source buffer are visible. A fully synchronous fallback is used when no command list is available (e.g. urMemGetNativeHandle). Only one staging buffer is kept alive at a time: it is released at the start of the next migration after zeCommandListHostSynchronize confirms the previous async copy has completed. A new ensureDeviceAlloc helper allocates the destination device buffer without the activeAllocationDevice side-effect of allocateOnDevice, so the active-device state is only updated after the async copy is successfully enqueued. Fixes: intel#22007 Fixes: intel#22008 Signed-off-by: Lukasz Dorau <lukasz.dorau@intel.com>
Add four conformance tests exercising discrete buffers accessed from two different device queues when P2P access is not available. Tests covering the async migration path (cmdList != nullptr, triggered by urEnqueueMem* operations): - AsyncFillThenReadOnSecondQueueWithWait: fills a buffer on queues[0] and reads it on queues[1] using an explicit event dependency. - PingPongFillBetweenTwoDeviceQueues: alternates fills between queues[0] and queues[1], each read on the opposite queue using event dependencies. - ChainedAsyncOpsAcrossQueuesWithEvents: chains fill, blocking write, and read across two queues using cross-queue events. Test covering the synchronous fallback path (cmdList == nullptr, triggered by urMemGetNativeHandle): - SyncFallbackMigrationViaNativeHandle: fills the buffer on device 0, calls urMemGetNativeHandle for device 1 to trigger synchronous host-staged migration, then verifies the data on device 1. All tests add an explicit queues.size() < 2 guard (GTEST_SKIP) in case the fixture minimum-device requirement changes, and cross-queue ordering is expressed with events throughout to properly exercise the async migration path. A dedicated L0 v2 adapter runner (discrete_buffer_host_migration.cpp) reuses the conformance test source under UR_LOADER_USE_LEVEL_ZERO_V2. Signed-off-by: Lukasz Dorau <lukasz.dorau@intel.com>
The test was intermittently failing on CI hardware because the queue create + USM fill + urQueueFinish sequence before the memory measurement introduced a multi-millisecond time window. During that window, async driver cleanup from earlier P2P tests (which can fail to evict peer residency via zeContextEvictMemory) or concurrent GPU workloads on shared CI machines could change devices[1]'s GLOBAL_MEM_FREE reading enough to trigger the assertion. The queue/fill/finish operations are not needed to test the residency property: zeContextMakeMemoryResident is invoked at urUSMDeviceAlloc time, so measuring immediately after the allocation captures any peer-residency side-effects without a blocking GPU operation in between. Remove those operations to keep the measurement window as short as possible, matching the pattern already used in allocationInitiallyAbsentOnPeer. Signed-off-by: Lukasz Dorau <lukasz.dorau@intel.com>
Contributor
Author
|
@mateuszpn @pbalcer re-review please |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When a buffer on a discrete GPU needs to be accessed from a different
device and P2P access is not enabled, migrate the data through a USM
HOST staging buffer instead of returning UR_RESULT_ERROR_UNSUPPORTED_FEATURE.
The migration uses a two-step copy:
list (the destination device cannot reach source device memory
without P2P).
memory is accessible by all devices, so this is safe).
Before the device->host copy, any pending operations on the caller's
command list are ordered and drained via zeCommandListAppendWaitOnEvents
source buffer are visible. A fully synchronous fallback is used when
no command list is available (e.g. urMemGetNativeHandle).
Only one staging buffer is kept alive at a time: it is released at the
start of the next migration after zeCommandListHostSynchronize confirms
the previous async copy has completed.
A new ensureDeviceAlloc helper allocates the destination device buffer
without the activeAllocationDevice side-effect of allocateOnDevice,
so the active-device state is only updated after the async copy is
successfully enqueued.
Fixes: #22007
Fixes: #22008